Remark
Please be aware that these lecture notes are accessible online in an ‘early access’ format. They are actively being developed, and certain sections will be further enriched to provide a comprehensive understanding of the subject matter.
4.6. Major Statistical Methods for Outlier Detection in Time Series#
4.6.1. Descriptive Statistics-Based Methods#
4.6.1.1. Z-Score Method#
Purpose: Measure how many standard deviations a point is from the mean.
Formula:
Outlier Criterion: \(|Z| > 3\)
Example 4.5
Daily temperatures with mean 72°F and standard deviation 8°F.
Reading of 98°F: \(Z = \frac{98-72}{8} = 3.25\) → Outlier
Reading of 85°F: \(Z = \frac{85-72}{8} = 1.625\) → Normal
Sensitive to existing outliers (outliers inflate \(\mu\) and \(\sigma\))
Assumes normal distribution
Fig. 4.23 Z-Score method applied to daily temperatures. Top panel shows the time series with detected outliers marked in red. Bottom panel shows z-score values, with points beyond ±3σ flagged.#
4.6.1.2. Modified Z-Score Method#
Purpose: Robust alternative using median and MAD (Median Absolute Deviation).
Formula:
where \(\tilde{x}\) = median, \(MAD = \text{median}(|x_i - \tilde{x}|)\)
Outlier Criterion: \(|M| > 3.5\)
Example 4.6
Monthly sales with median \(50,000 and MAD \)8,000.
Sales of \(85,000: \)M = \frac{0.6745(85000-50000)}{8000} = 2.95$ → Normal
Sales of \(120,000: \)M = \frac{0.6745(120000-50000)}{8000} = 5.89$ → Outlier
Resistant to extreme values
Better for skewed distributions
Fig. 4.24 Modified Z-Score method applied to monthly sales data with seasonal patterns and skewness. This robust method uses median and MAD, making it less sensitive to existing outliers.#
4.6.1.3. Interquartile Range (IQR) Method#
Purpose: Range-based detection using the middle 50% of data.
Method:
Calculate \(IQR = Q3 - Q1\)
Define bounds: \([Q1 - 1.5 \times IQR,\ Q3 + 1.5 \times IQR]\)
Outlier Criterion: Points outside the bounds
Example 4.7
Hourly website traffic with Q1 = 1,200 and Q3 = 1,800 visitors.
\(IQR = 1,800 - 1,200 = 600\)
Lower bound: \(1,200 - 1.5(600) = 300\)
Upper bound: \(1,800 + 1.5(600) = 2,700\)
Traffic of 3,500 → Outlier
Very robust against outliers
Works well with skewed distributions
Foundation of box plot outlier detection
Fig. 4.25 IQR method applied to hourly website traffic. Top panel shows time series with outlier bounds derived from Q1 and Q3. Bottom panel groups data by hour, revealing context-specific outliers.#
4.6.2. Time Series Modeling-Based Methods#
4.6.2.1. ARIMA-Based Detection#
Purpose: Model normal time series behavior with ARIMA using historical data, then detect outliers by comparing predictions against actual observations.
Method:
Split data into training (historical) and testing (recent) periods
Fit ARIMA(p,d,q) or SARIMA model on training data
Generate forecasts for testing period
Calculate forecast errors: \(\text{Error}_t = \text{Actual}_t - \text{Forecast}_t\)
Flag outliers where \(|\text{Error}| > k\sigma_{\text{error}}\) (typically \(k=3\))
Outlier Types Detected:
Additive Outliers (AO): Single-point spikes that return to normal
Level Shifts (LS): Permanent level changes after a specific time
Temporary Changes (TC): Gradual decay patterns with persistence parameter \(\delta\)
Note
ARIMA modeling is covered in detail in Chapter XX. This section focuses on using fitted ARIMA models for outlier detection through forecast comparison.
Example 4.8
Monthly airline passengers (2010–2022)
Training: 2010–2019 (120 months) to capture normal patterns
Testing: 2020–2022 (36 months) to detect anomalies
Model: SARIMA(0,1,1)(0,1,1)[12] captures trend and seasonality
Normal forecast error: ±150 passengers
March 2020 forward: Errors exceed −2000 passengers → Outliers (COVID-19 impact)
Realistic out-of-sample evaluation
Accounts for autocorrelation, trend, and seasonality
Clear separation between normal variation and true anomalies
Can classify outlier types through error pattern analysis
Requires sufficient historical data (minimum 2–3 seasonal cycles)
Model misspecification can produce false positives
Assumes future follows historical patterns (breaks during regime changes)
Forecast uncertainty increases with horizon length
Not suitable for data with frequent structural breaks
Loading ITables v2.6.1 from the init_notebook_mode cell...
(need help?) |
| Value | |
|---|---|
| model | SARIMA(0,1,1)(0,1,1)[12] |
| train_period | 2010-01 to 2019-12 |
| test_period | 2020-01 to 2022-12 |
| train_months | 120 |
| test_months | 36 |
| train_resid_std | 887.25 |
| test_error_std | 1318.63 |
| train_outliers | 2 |
| test_outliers | 3 |
| outlier_dates_test | [2020-03, 2020-04, 2020-05] |
The ARIMA model fits the training period well, but forecast errors in 2020–2022 become much larger, especially right after early 2020, which signals a structural break that the model was not built to anticipate. The first few months of the pandemic show large negative errors (actual far below forecast), then errors gradually move back toward zero as demand recovers.
The very large, short‑lived drops behave like a temporary change (TC) rather than a permanent level shift, because the series slowly returns toward its pre‑COVID trajectory. Earlier anomalies in the training period would appear as isolated spikes (AO) or step changes (LS), but in this example the main emphasis is on the regime change in the test window.
When this setup works well
There is a reasonably stable historical pattern (trend + seasonality) to learn from.
The goal is to spot periods where reality departs sharply from “business as usual” forecasts.
You can afford to periodically refit the model as new data arrive.
Main caveats
ARIMA assumes the future looks statistically similar to the past; large regime shifts (pandemics, policy shocks) break this assumption, so very large errors may be better interpreted as regime change than isolated outliers.
Forecast uncertainty grows with horizon, so very long test windows and a fixed ±3σ rule can either miss relevant anomalies or over‑flag normal volatility.
The 3σ threshold is heuristic; in practice it should be tuned using domain knowledge and tolerance for false alarms.
Fig. 4.26 Out-of-sample ARIMA-based outlier detection using a train–test split. Panel (a) shows training data plus test-period actual vs forecast and flagged test outliers; panel (b) shows corresponding forecast errors with a ±3σ band.#
4.6.2.2. STL Decomposition#
Purpose: Separate a time series into trend, seasonal, and residual components; detect outliers in the residuals.
Method:
Decompose: Series = Trend + Seasonal + Residual
Apply outlier detection to residual component only (e.g., \(|\text{Residual}| > 3\sigma\))
Large residuals indicate values that the trend and seasonality cannot explain
Outlier Criterion: \(|\text{Residual}| > 3\sigma_{\text{residual}}\)
Example 4.9
Monthly ice cream sales (2019–2024)
Strong summer peaks (seasonality) and gradual growth (trend)
January 2023: residual = +15,000 units (all others ±5,000) → Outlier (unusual warm weather?)
Handles complex, non-linear seasonality
Robust decomposition (outliers don’t distort trend/seasonal estimates)
Clear visual separation of components
After removing the trend and seasonal components, any remaining spike in the residuals is something the model cannot explain—this is your outlier. The January 2023 value looks normal on the surface (it’s within the typical summer range), but the decomposition reveals it’s unusually high for January, flagging it as an anomaly.
When to use this
Data has clear, regular seasonality (monthly, daily, hourly patterns).
You want to isolate structure (trend + season) separately from anomalies.
You can fit the series robustly (STL’s
robust=Truehandles outliers well).
Main limitation
If the model cannot fit the trend or seasonality well (e.g., from bad parameter choices), residuals become noisy and threshold-based detection becomes unreliable.
Fig. 4.27 STL decomposition of monthly ice cream sales. Panels show observed data, trend, seasonal pattern, and residuals. Outliers appear as spikes in the residual panel beyond ±3σ bands.#
4.6.2.3. Change Point Detection#
Purpose: Identify sudden shifts in the statistical properties of a time series (mean, variance, or trend).
Method: Use CUSUM (Cumulative Sum Control Chart) to detect when a series mean shifts. Flag points near the detected change as potential anomalies.
Example 4.10
Server response times
Baseline: 200 ms with low variance (stable operation)
Day 45 onward: mean jumps to ~260 ms, variance increases
Days 44–46 show spikes >1000 ms → system-level issue, not isolated outliers
Large errors clustered around a change point often indicate a regime shift rather than scattered anomalies.
The CUSUM chart shows the cumulative deviation from a baseline level. When it crosses the threshold (red line), it signals a sustained shift in the mean—not an isolated spike, but a regime change. The large response times around day 45 are symptoms of the underlying system issue, not separate outliers to flag independently.
When to use this
Monitoring systems that can undergo sudden shifts (operational changes, hardware failures, traffic surges).
You want to distinguish between normal variability and structural breaks.
Interest is in identifying when things changed, not just individual anomalies.
Limitation
CUSUM works well for detecting mean shifts but not sudden spikes in variance. For that, use specialized libraries (e.g.,
ruptures,pelt).
Fig. 4.28 Change point detection using CUSUM. Panel 1 shows response times with a mean shift around day 45 and associated spikes. Panel 2 shows the CUSUM statistic triggering an alarm near the shift.#